{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# VIMER-StrucTexT 1.0\n",
    "\n",
    "## 1.模型简介\n",
    "\n",
    "随着OCR技术在产业应用的快速发展，现实场景对OCR提出新的需求：从感知走向认知。结构化信息提取逐渐成为OCR产业应用的核心技术之一，旨在快速且准确地分析卡证、票据、档案图像等富视觉数据中的结构化文字信息，并对关键数据进行提取。\n",
    "因此，百度OCR团队提出了一个联合字符级别和字段级别文本多模态特征增强的预训练大模型VIMER-StrucTexT，支持文档、票据、卡证等富视觉图像中的OCR结构化信息提取，并应用于实体分类和实体链接两大任务类型。VIMER-StrucTexT通过实现字符级和字段级表达灵活输出的框架，能够根据下游任务的特性，选择合理的建模粒度去支持上述的结构化任务。除此之外，针对文档票据等富视觉文本图像，充分利用大数据优势，创新性引入字段长度预测和字段方位预测等自监督预训练方式，VIMER-StrucTexT拥有更丰富的多模态特征表达。\n",
    "\n",
    "![structext_arch](./doc/structext_arch.png#pic_center)\n",
    "<center><b>模型结构</b></center>\n",
    "\n",
    "## 2.模型效果\n",
    "\n",
    "我们在三个文档结构化理解任务上微调了VIMER-StrucTexT，分别是字符级别实体分类(**T-ELB**)、字段级别实体分类(**S-ELB**)、字段级别实体连接(**S-ELK**)。**注意：**后文中的实体分类任务的效果指标均为**字段级别 F1 score**，该指标计算相关字段类型的Macro F1 score。\n",
    "\n",
    "### 2.1 字符级别实体分类\n",
    "\n",
    "#### 2.1.1 数据集\n",
    "\n",
    "- [EPHOIE](https://github.com/HCIILAB/EPHOIE)主要来源于扫描版中文试卷文档。\n",
    "该数据集包括10种字段类型，每种字段均为字符级别的标注，也就是说一段文本行中不同的单字符可能属于不同的字段类型。字符级别实体分类包含的类型：学科、测试时间、姓名、学校、考试号、座位号、班级、学号、年级、分数。\n",
    "| 模型                          | **字段级别 F1 score**          |\n",
    "| :---------------------------- | :----------------------------: |\n",
    "| Base model      |            0.9884              |\n",
    "| Large model      |           0.9930              |\n",
    "\n",
    "### 2.2 字段级别实体分类\n",
    "\n",
    "#### 2.2.1 数据集\n",
    "\n",
    "- [SROIE](https://rrc.cvc.uab.es/?ch=13&com=introduction)是一个用于票据信息抽取的公开数据集，由ICDAR 2019 Chanllenge提供。它包含了626张训练票据数据以及347张测试票据数据，每张票据都包含以下四个预定义字段：`公司名, 日期, 地址, 总价`。\n",
    "- [FUNSD](https://guillaumejaume.github.io/FUNSD/)是一个用于表单理解的数据集，它包含199张真实的、完全标注的扫描版图片，类型包括市场报告、广告以及学术报告等，并分为149张训练集以及50张测试集。FUNSD数据集适用于多种类型的任务，我们专注于解决其中的字段级别实体分类以及字段级别实体连接任务。\n",
    "- [XFUND](https://github.com/doc-analysis/XFUND)是一个多语种表单理解数据集，它包含7种不同语种表单数据，并且全部用人工进行了键-值对形式的标注。其中每个语种的数据都包含了199张表单数据，并分为149张训练集以及50张测试集，我们测试了XFUND的中文子数据集。\n",
    "\n",
    "| 模型        | **SROIE**    | **FUNSD**      | **XFUND-ZH**    |\n",
    "| :-----------------------------| :----------------------------: | :----------------------------: | :----------------------------: |\n",
    "| Base model      |           0.9827               |           0.8483               |\n",
    "| Large model     |           0.9870               |           0.8756               |\n",
    "\n",
    "### 2.3 字段级别实体连接\n",
    "\n",
    "#### 2.3.1 数据集\n",
    "\n",
    "- [FUNSD](https://guillaumejaume.github.io/FUNSD)标注的连接格式为`(entity_from, entity_to)`，代表着该连接为一对“问题-答案”，模型主要任务为预测语义实体之间的连接关系。\n",
    "- [XFUND](https://github.com/doc-analysis/XFUND)的实验设置与FUNSD相同，我们在其中的中文子数据集测试了模型的效果。\n",
    "\n",
    "| 模型        | **SROIE**    | **FUNSD**      | **XFUND-ZH**    |\n",
    "| :-----------------------------| :----------------------------: | :----------------------------: | :----------------------------: |\n",
    "| Base model      |           0.7045               |           0.8009               |\n",
    "| Large model     |           0.7421               |           0.8681               |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.快速体验\n",
    "\n",
    "### 3.1 安装PaddlePaddle\n",
    "\n",
    "本代码库基于`PaddlePaddle 2.1.0+`，你可参阅[paddlepaddle-quick](https://www.paddlepaddle.org.cn/install/quick)进行环境准备，或者使用pip进行安装："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip3 install paddlepaddle-gpu --upgrade -i https://mirror.baidu.com/pypi/simple"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 依赖库\n",
    "\n",
    "相关依赖库已在`requirements.txt`中列出，你可以使用以下命令行进行依赖库安装：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip3 install --upgrade -r requirements.txt -i https://mirror.baidu.com/pypi/simple"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 推理模型\n",
    "#### 3.3.1 EPHOIE数据集字符级别实体分类推理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python data/make_ephoie_data.py --config_file ./configs/(base/large)/labeling_ephoie.json --label_file examples/ephoie/test_list.txt --label_dir <ephoie_folder>/final_release_image_20201222/ --kvpair_dir <ephoie_folder>/final_release_kvpair_20201222/ --out_dir <ephoie_folder>/test_labels\n",
    "! python ./tools/eval_infer.py --config_file ./configs/(base/large)/labeling_ephoie.json --task_type labeling_token --label_path <ephoie_folder>/test_labels/ --image_path <ephoie_folder>/final_release_image_20201222/ --weights_path StrucTexT_ephoie_(base/large)_labeling.pdparams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3.2 FUNSD数据集字段级别实体分类任务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python data/make_funsd_data.py --config_file ./configs/base/labeling_funsd.json --label_dir <funsd_folder>/dataset/testing_data/annotations/ --out_dir <funsd_folder>/dataset/testing_data/test_labels\n",
    "! python ./tools/eval_infer.py --config_file ./configs/(base/large)/labeling_ephoie.json --task_type labeling_token --label_path <ephoie_folder>/test_labels/ --image_path <ephoie_folder>/final_release_image_20201222/ --weights_path StrucTexT_ephoie_(base/large)_labeling.pdparams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3.3 FUNSD数据集字段级别实体连接任务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python data/make_funsd_data.py --config_file ./configs/base/linking_funsd.json --label_dir <funsd_folder>/dataset/testing_data/annotations/ --out_dir <funsd_folder>/dataset/testing_data/test_labels\n",
    "! python ./tools/eval_infer.py --config_file ./configs/base/linking_funsd.json --task_type linking --label_path <funsd_folder>/dataset/testing_data/test_labels/ --image_path <funsd_folder>/dataset/testing_data/images/ --weights_path StrucTexT_funsd_base_linking.pdparams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.产品应用\n",
    "\n",
    "以下可视化数据来源于StrucTexT的实际应用效果。*不同颜色代表不同的实体类别，实体之间的黑色连接线代表它们属于同一实体，橙色连接线代表实体之间存在连接关系。*\n",
    "- 购物小票\n",
    "![example_receipt](./doc/receipt_vis.png#pic_center)\n",
    "- 船票/车票\n",
    "![example_busticket](./doc/busticket_vis.png#pic_center)\n",
    "- 机打发票\n",
    "![example_print](./doc/print_vis.png#pic_center)\n",
    "- 更多相关信息与应用，请参考[Baidu OCR](https://ai.baidu.com/tech/ocr)开放平台。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.引用\n",
    "相关文献请引用：\n",
    "```\n",
    "@inproceedings{li2021structext,\n",
    "  title={StrucTexT: Structured Text Understanding with Multi-Modal Transformers},\n",
    "  author={Li, Yulin and Qian, Yuxi and Yu, Yuechen and Qin, Xiameng and Zhang, Chengquan and Liu, Yan and Yao, Kun and Han, Junyu and Liu, Jingtuo and Ding, Errui},\n",
    "  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},\n",
    "  pages={1912--1920},\n",
    "  year={2021}\n",
    "}\n",
    "```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
